2 Mainline results: 250k training epochs
2.1 Using naive threshold of 0.05
Criteria for model inclusion: is the “dropout rate parameter” \(\alpha\) lower than our T1 error threshold. This interprets \(\alpha\) as a probability, which is not great (\(\alpha\) is commonly > 1 for the nuisance variables)
2.1.1 Type II errors:
Total count of Type II errors and T2 error rate over 200 simulated datasets:
## [1] 95
## [1] 0.11875
All errors are from the first covariate (\(\beta = 0.5\)). Table below shows \(\alpha\) for first 4 covariates for all 200 simulated datasets.
2.2 Interpreting \(\alpha\) as posterior-based Wald statistic
A more principled approach might be to compare the dropout parameter \(\alpha\) against the inverse of a \(\chi^2\) distribution with 1 degree of freedom and applying Bonferroni. This is based on the idea that
(to justify using \(\chi^2 (1)\) distribution) the \(\alpha\) parameter is the inverse of the posterior-based Wald statistic discussed in Liu, Li, Yu 2020 (referred to as LLY 2020);
(to justify Bonferroni) the mean-field assumption used in variational inference assumes independence between the individual \(\alpha\) parameters (\(\alpha_i = \dfrac{Var(\tilde{z_i})}{ \left[ E(\tilde{z_i}) \right] ^2}\).
2.2.1 LLY 2020
LLY 2020 propose the posterior-based Wald statistic
\[W = (\bar\theta - \theta_0)'[V_{\theta \theta} (\bar\nu)]^{-1} (\bar\theta - \theta_0) \overset{d}{\rightarrow} \chi^2(q_\theta)\]
\(\theta\) is the parameter(s) of interest (\(\bar\theta\) refers to the posterior mean),
\(q_\theta\) the dimension of \(\theta\),
\(V_{\theta \theta} (\bar\nu)\) the portion of the posterior covariance matrix relevant to \(\theta\) (\(\nu\) refers to all estimated parameters),
2.2.2 Results for posterior-based Wald interpretation of \(\alpha\)
2.2.2.1 Type II error
Type II error count & rate:
wald_thresh <- 1 / qchisq(1 - (0.05 / 104), df = 1)
t2_sum <- sum(final_alphas[, 1:4] > wald_thresh)
t2_sum # count T2 errors## [1] 39
## [1] 0.04875
Results: 39 errors total out of a possible 800 (4 true covariates, 200 simulations)
I.e. T2 error rate of 0.04875.
2.2.3 Problem with posterior Wald interpretation?
Below is a histogram of the \(\alpha\) parameters, for the 100 nuisance parameters, appearing in the last training epoch of the 100 simulations, followed by a histogram of 1000 draws from a \(\chi^2(1)\).
The two do not match well. However, there are a few possible explanations:
variational inference is known to underestimate variance (which would push the mode to the right);
maybe explaining why there are NO values below 0.7: in LLY 2020, Theorem 3.1, it is clarified that the Bayesian \(W\) and the Frequentist \(Wald\) are not quite the same:
\[W = Wald + o_p(1) \overset{d}{\rightarrow} \chi^2(q_\theta)\]
Last, we are just looking at the final training epochs. These simulations simply set a maximum number of 100k training epochs, so it’s possible that the network should have trained for a longer period of time.
2.2.4 notes / thoughts:
- any way to get ELBOs for competing models to approximate bayes factors? Is this even useful / desirable? Avoiding this kind of computation is kind of the reason we’re using NN’s in the first place….